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1. INTRODUCTION 

Professor Cook is to be congratulated for his 
ground-breaking work in dimension reduction in re- 
gression. The paper develops a general theoretical 
foundation for studying principal components and 
other dimension reduction methods in a regression 
context. This framework yields a basis for eluci- 
dating the strengths, weaknesses and relationships 
among the various dimension reduction methods, 
including ordinary least squares (OLS), principal 
components regression (PCR), sliced inverse regres- 
sion (SIR) , parametric inverse regression and partial 
least squares. The promising new method, princi- 
pal fitted components (PFC), appears to outperform 
some long-standing approaches such as PCR, OLS 
and SIR. Finally, as a result of this contribution, the 
standard approach to regression, with its emphasis 
on fixed predictors and the need to assume away 
the randomness of X, and the standard approach to 
principal components, with its focus on the correla- 
tion matrix rather than the covariance matrix, both 
seem to be under question. 

Specific contributions of Professor Cook's paper 
include the following: (1) It provides a theoretical 
foundation for the widely used principal components 
regression. (2) It resorts to a model and thus a like- 
lihood function, through the inverse regression of 
predictors given response, to study sufficient reduc- 
tion in a forward regression problem. Consequently, 
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likelihood-based inferences can be developed, and 
the inferential capabilities of dimension reduction 
are moved closer to mainstream regression method- 
ology. (3) It permits extension to categorical or mix- 
tures of continuous and categorical predictors, an 
area that most existing model-free dimension re- 
duction approaches do not handle effectively. (4) It 
seems to be applicable to problems where the num- 
ber of predictors exceeds the number of observa- 
tional units, which is perhaps one of the most chal- 
lenging current frontiers in statistical methodology. 
In general, we believe that this paper has paved the 
way for substantive research in dimension reduction, 
and that it will surely be the subject of much future 
application and elaboration. 

In what follows, we focus on two issues raised but 
not thoroughly addressed in Professor Cook's pa- 
per: (1) the role of predictor screening and a connec- 
tion with the recently proposed supervised principal 
components method and (2) dimension reduction in 
the presence of binary predictors. 

2. SUPERVISED PRINCIPAL COMPONENTS 

Professor Cook has suggested a combination of 
predictor screening and principal fitted components 
analysis when the number of predictors p is large — 
in particular when n<p. As Professor Cook noted, 
traditional studies have based predictor screening on 
the univariate forward regressions of Y on individ- 
ual Xj, j = 1, . . . ,p. In view of the results presented 
in this paper, it appears that the screening at the 
outset should probably be based on the univariate 
inverse regressions of Xj on i y . This in turn suggests 
the following algorithm for PFC in conjunction with 
predictor screening: 

1. Compute the inverse univariate regressions Xj 
on { y , for j = l,...,p. 

2. Form a reduced matrix X# composed of only those 
predictors whose regression coefficient on f y is de- 
termined to surpass a level-of-significance thresh- 
old 9. 

3. Obtain T using PFC based on the reduced X#. 
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4. Pass r to the forward regression. 

Recently, Bair et al. (2006) proposed an approach 
called supervised principal components (SPC), which 
can be described as follows: 

1. Compute the forward univariate regressions Y on 
individual Xj, j = 1, . . . ,p. 

2. Form a reduced X# matrix composed of only those 
features whose regression coefficient exceeds a 
threshold 9 in absolute value. 

3. Compute the first, or first few, principal compo- 
nents of the reduced data matrix. 

4. Use these principal component(s) in a regression 
model to predict the outcome. 

The two approaches look strikingly similar, but 
there are some essential differences. First, with Cook's 
approach, predictor screening is based on univariate 
inverse regressions, whereas forward univariate re- 
gressions are employed in SPC. Of course, the two 
screening approaches would frequently yield similar 
reductions since the significance of a regression of 
Y on Xj will often be associated with significance 
in the regression of Xj on Y. Differences between 
two regressions arise in some cases, however. For 
instance, if is quadratic in Y with no linear 

trend, Xj would be retained by an inverse quadratic 
regression but deleted by the forward SPC screen- 
ing, since there would be no linear trend in V|Xj. 
Conversely, if Y\Xj is quadratic in Xj, again with 
no linear trend, it will be deleted by both proce- 
dures. This is because the forward screening consid- 
ers only linear trend (and there is none) while the 
inverse regression -Xjl^ would exhibit a null trend 
with nonconstant variance. 

Second, we note that the underlying model as- 
sumed by SPC can be viewed special case of 
the PFC model given in Cook (2007). Consider (9) 
and (10) of Bair et al. (2006), in which both the re- 
sponse and the predictors are linked by a univariate 
latent variable U: 

Y = /3 + f3 1 U + e, 

Xj = a j + aijU + ej, j = l,...,p, 

where e and ej are assumed to have mean 0, are 
independent and are also independent of U. Conse- 
quently, 

Xj\(Y = y) 

( AjQuA . aij ( aij \ 



Let r denote the p x 1 vector with the jth element 
7j/c, where c = (£j=i7j) 1/2 , i y = c(y - y), ^ de- 
notes the p x 1 vector with the jth element fj,j + -jjij 
and e* denotes the p x 1 vector with the jth ele- 
ment £j. We then obtain 

(1) Xy = fl + Tiy + e*. 

Model (1) is similar to model (5) in Cook (2007), 
where the predictors Xj given Y are conditionally 
independent since e and €j are independent, although 
var(X,-|y = y) are not necessarily equal. Let a 2 = 
var(e), e x = (e 1 ,..., e p ) T , and T, £x = var(e x ), the di- 
agonal matrix with the jth diagonal element equal 
to var(ej). Then some algebra reveals that the model 
(1) can be viewed in the form of model (13) in Cook 
(2007), 

Xj, = m + vi y + r n eo + rfie, 

where Tq 6 MP x (p~-0 denotes an orthogonal com- 

pietion of r, ft = (rJs 6i r ) 1 /2 ) ft = (r T s e ^r + 

c 2 ^ 2 ) 1 / 2 , £ is an error variable with mean and 
variance 1, £o is a (p — 1) x 1 vector of error vari- 
ables with mean and identity variance, e and £o 
are independent, and /i, T and i y are as defined be- 
fore. Clearly, the models discussed in Cook (2007) 
are more flexible than model (1) assumed by SPC, 
in the sense that they permit a more flexible choice 
for i y and a more flexible variance structure. 

To compare the two approaches empirically, we 
applied SPC and PFC to the logo data of Hen- 
derson and Cote (1998), which was also examined 
by Li and Nachtsheim (2006) in the context of di- 
mension reduction in regression. The data charac- 
terize n = 195 company logos in terms of their over- 
all affect, Y, which is a composite of various sub- 
jective reactions to a logo and 22 design charac- 
teristics X. Using a significance level of a = 0.05 
for predictor screening, univariate forward regres- 
sions led to the dropping of eight predictors for the 
SPC approach, while the univariate inverse regres- 
sions on i y = (y — y,y 2 — y 2 ) T implicated the same 
eight predictors plus one other. Since it was con- 
cluded in Li and Nachtsheim (2006) that the data 
exhibit one-dimensional structure (d=l), we em- 
ployed SPC and PFC to extract the first reduction 

variable T X. Figure 1 shows the summary plot of 

Y versus T X with and without predictor screening. 
For the full predictor case, PFC chose the fifth PC 
direction, whereas SPC is always based on the first. 
Nonetheless, the results are quite similar. Following 
screening, PFC selected the first PC direction. The 
similarities in the plots in this case are as expected. 
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Fig. 1. Sufficient summary plots for SPC and PFC with and without predictor screening. 



3. BINARY PREDICTORS 

A key contribution of Professor Cook's paper has 
been to provide the logic on which to base an ex- 
tension of dimension reduction to categorical pre- 
dictors. Section 3.3 of Cook (2007) gives a brief pre- 
scription for such extensions, and we focus here on 
the use of principal components for binary predic- 
tors. 

In many applications such as in genetics and data 
mining, it is common to encounter data sets where 
all predictors are dichotomous. For instance, in quan- 
titative genetics, the genetic architecture is recorded 
by a sequence of binary markers. In a typical email 
spam filtering system, each email is denoted by a 
vector of binary features indicating whether a set of 
commonly occurring words is present. These exam- 
ples underscore the relevance of binary predictors in 
applications, yet existing dimension reduction ap- 
proaches have focused mainly on the continuous- 
predictor case. 



Following the formulation given in Section 3.3 of 
Cook (2007), we assume that each predictor Xj, 
j = 1, . . . ,p, given Y = y, follows a Bernoulli distri- 
bution with parameter Pj{y) = Pr(Xj = Xj\y), and 
all predictors are conditionally independent. The log 
likelihood for a, single Xj sit £/, Xjy } is therefore 



(2) 



log#(y) +x jy (fij +-yjv y ) 



where qj(y) = 1 - pj(y) = 1/{1 + exp(//j + yjv y )}. 
The overall log likelihood is then 

(3) XX lo S9i(y) + x jy(H + l] v v)}- 

y j 

To obtain the maximum likelihood estimate of fj,, 
v and r, an iterative algorithm can be employed. 
More specifically, fixing and T, we first estimate 
the n UyS by adding the log likelihood (2) over j and 
obtaining ^{logg^?/) + Xj y (/j,j + r y]v y )}- Given fij 
and 7j, estimates can be obtained from a standard 
logistic regression maximization, by treating (x±y, 
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. . . , x py ) as the "response," (fix, . . . , fj, p ) as the "off- 
set" and 7 J as the jth vafue of a d-dimensional vec- 
tor of "predictors," and the logistic regression is fit- 
ted through the origin. We estimate the d-dimensional 
parameter vector u y based on p observations, and 
in total we fit n logistic regressions. Next, to esti- 
mate /ij given fixed v y and T, we simply maximize 
J2 y {^ogqj(y) + Xj y (fj,j +jj u y )}. This corresponds to 
a univariate logistic regression with intercept [ij and 
offset jjfy, and we fit p such logistic regressions in 
total. Finally, fixing fj, and is, we need to maximize 
(3) over T in a Grassmann manifold. Optimization 
in Grassmann manifolds is discussed in Edelman, 
Arias and Smith (1999). 

We implemented the above iterative optimization 
algorithm and tested the method with the following 
simulation setup. The response Y was generated as 
a normal random variable with mean and variance 
Oy , and X|y = y was generated as a Bernoulli ran- 
dom vector with the natural parameter rj y S W and 
the canonical link function g, 

riy = g(E(X\Y = y))=Tpfy. 

We fixed the sample size n = 200 and the num- 
ber of predictors p = 20. We first considered the 
d = 1 case and examined different forms of T: Ti = 
(1, . . . , 1, 0, . . . , 0) T / \/T0, with ten Is and the remain- 
ing 0s; T 2 = (1,1,1,0.5,0.5,-0.5,-0.5,-1, 
-1,-1,0,. ..,0) T /V7; and T 3 = (1, . . . , 1, 0.5, . . . , 
0.5, -0.5, . . . , -0.5, -1, . . . , -l) T /v / 12^5, with each el- 
ement repeated five times. (3 = 1 and f y is the cen- 
tered linear term y. We also considered a d = 2 case, 
by choosing T 4 = ((1, ...,1,0,..., 0) T , (0, . . . , 0, 1, . . . , 
l) T )/\/T0, with ten Is in each direction; (3 = diag(l, 
0.1); and i y = {y,y 2 ) T centered. 

Table 1 reports the average angles (in degrees) 
between T and T out of 50 data replications, as 
cry, the term that controls the "signal" strength, in- 
creases. It is first noted that the estimation accuracy 
increases as ay increases, as expected. Second, we 



have observed that the optimization algorithm may 
become quite unstable in some cases, as reflected by 
the nonconvergence of the logistic fit when estimat- 
ing fi and v. [This instability in turn has caused 
poor estimation accuracy. For instance, in the case 
of T4, even with a relatively large ay value, the av- 
erage angle is large.] 

To make the principal components for all binary 
predictors practically useful, a stable version of the 
iterative maximization algorithm is needed. One pos- 
sible solution is the majorization strategy as sug- 
gested by Kiers (2002). Future research along this 
direction seems warranted. 



Table 1 

Average of angles (in degrees) between T and T given ay 
based on 50 data replications 
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